Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR

نویسندگان

M. Ali Basha Shaik

Amr El-Desoky Mousa

Ralf Schlüter

Hermann Ney

چکیده

German is a morphologically rich language having a high degree of word inflections, derivations and compounding. This leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities in the large vocabulary continuous speech recognition (LVCSR) systems. One of the main challenges in the German LVCSR is the recognition of the OOV words. For this purpose, data-driven morphemes are used to provide higher lexical coverage. On the other hand, the probability estimates of a sub-lexical LM could be further improved using feature-rich LMs like maximum entropy (MaxEnt) and class-based LMs. In this work, for a sub-lexical level German LVCSR task, we investigate the use of the multiple morpheme level features as classes for building class-based LMs that are estimated using the state-of-the-art MaxEnt approach. Thus, the benefits of both the MaxEnt LMs and the traditional class-based LMs are effectively combined. Furthermore, we experiment the use of Maximum a-posteriori adaptation over the MaxEnt class-based LMs. We show consistent reductions in both the OOV recognition error rate and the word error rate (WER) on a German LVCSR task from the Quaero project, compared to the traditional class-based and the N -gram morpheme based LM.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigation on language modelling approaches for open vocabulary speech recognition

By definition, words that are not present in a recognition vocabulary are called out-of-vocabulary (OOV) words. Recognition of unseen or new words is an important feature that is always desired in any real-world large vocabulary continuous speech recognition (LVCSR) system. However, human languages are complex in nature due to wide varieties of morphological richness such as inflections, deriva...

متن کامل

Investigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR

For languages like German and Polish, higher numbers of word inflections lead to high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Thus, one of the main challenges in large vocabulary continuous speech recognition (LVCSR) is recognizing an open vocabulary. In this paper, we investigate the use of mixed type of sub-word units in the same recognition lexicon. Namely, m...

متن کامل

Morpheme Level Feature-based Language Models for German LVCSR

One of the challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) of German is its complex morphology and high level of compounding. It leads to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probabilities. In such cases, building LMs on morpheme level can be considered a better choice. Thereby, higher lexical coverage and lower LM perplexities are achieved. On ...

متن کامل

Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR

German is a highly inflected language with a large number of words derived from the same root. It makes use of a high degree of word compounding leading to high Out-of-vocabulary (OOV) rates, and Language Model (LM) perplexities. For such languages the use of sub-lexical units for Large Vocabulary Continuous Speech Recognition (LVCSR) becomes a natural choice. In this paper, we investigate the ...

متن کامل

The RWTH Aachen German and English LVCSR systems for IWSLT-2013

In this paper, German and English large vocabulary continuous speech recognition (LVCSR) systems developed by the RWTH Aachen University for the IWSLT-2013 evaluation campaign are presented. Good improvements are obtained with state-of-the-art monolingual and multilingual bottleneck features. In addition, an open vocabulary approach using morphemic sub-lexical units is investigated along with t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR

نویسندگان

چکیده

منابع مشابه

Investigation on language modelling approaches for open vocabulary speech recognition

Investigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR

Morpheme Level Feature-based Language Models for German LVCSR

Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR

The RWTH Aachen German and English LVCSR systems for IWSLT-2013

عنوان ژورنال:

اشتراک گذاری